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Abstract 

Empirical evidence suggests that hashing is an 
effective strategy for dimensionality reduction 
and practical nonparametric estimation. In this 
paper we provide exponential tail bounds for fea- 
ture hashing and show that the interaction be- 
tween random subspaces is negligible with high 
probability. We demonstrate the feasibility of 
this approach with experimental results for a new 
use case — multitask learning with hundreds of 
thousands of tasks. 



1. Introduction 

Kernel methods use inner products as the basic tool for 
comparisons between objects. That is, given objects 
Xi, . . . ,x n <G X for some domain X, they rely on 

k{xi,Xj) := {<f>(zi),<f>(xj)) (1) 

to compare the features cf>(xi) of Xi and 4>{xj) of Xj respec- 
tively. 

Eq. (1) is often famously referred to as the kernel-trick. It 
allows the use of inner products between very high dimen- 
sional feature vectors <fi(xi) and <fi(xj) implicitly through 
the definition of a positive semi-definite kernel matrix k 
without ever having to compute a vector 4>(xi) directly. 
This can be particularly powerful in classification settings 
where the original input representation has a non-linear de- 
cision boundary. Often, linear separability can be achieved 
in a high dimensional feature space cj>(xi). 

In practice, for example in text classification, researchers 
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frequently encounter the opposite problem: the original in- 
put space is almost linearly separable (often because of the 
existence of handcrafted non-linear features), yet, the train- 
ing set may be prohibitively large in size and very high di- 
mensional. In such a case, there is no need to map the input 
vectors into a higher dimensional feature space. Instead, 
limited memory makes storing a kernel matrix infeasible. 

For this common scenario several authors have recently 
proposed an alternative, but highly complimentary vari- 
ation of the kernel-trick, which we refer to as the 
hashing-trick: one hashes the high dimensional input vec- 
tors x into a lower dimensional feature space R m with 
(j> : X -> R m (Langford et al., 2007; Shi et al., 2009). The 
parameter vector of a classifier can therefore live in M. m 
instead of in R™ with kernel matrices or M. d in the origi- 
nal input space, where m <C n and m <C d. Different 
from random projections, the hashing-trick preserves spar- 
sity and introduces no additional overhead to store projec- 
tion matrices. 

To our knowledge, we are the first to provide exponential 
tail bounds on the canonical distortion of these hashed inner 
products. We also show that the hashing-trick can be partic- 
ularly powerful in multi-task learning scenarios where the 
original feature spaces are the cross-product of the data, X, 
and the set of tasks, U. We show that one can use different 
hash functions for each task <fii, . . . , ^lyi to map the data 
into one joint space with little interference. 

While many potential applications exist for the hashing- 
trick, as a particular case study we focus on collaborative 
email spam filtering. In this scenario, hundreds of thou- 
sands of users collectively label emails as spam or not- 
spam, and each user expects a personalized classifier that 
reflects their particular preferences. Here, the set of tasks, 
U, is the number of email users (this can be very large for 
open systems such as Yahoo Mail™oi Gmail™), and the 
feature space spans the union of vocabularies in multitudes 
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of languages. 

This paper makes four main contributions: 1. In sec- 
tion 2 we introduce specialized hash functions with unbi- 
ased inner-products that are directly applicable to a large 
variety of kernel-methods. 2. In section 3 we provide ex- 
ponential tail bounds that help explain why hashed fea- 
ture vectors have repeatedly lead to, at times surprisingly, 
strong empirical results. 3. Also in section 3 we show that 
the interference between independently hashed subspaces 
is negligible with high probability, which allows large-scale 
multi-task learning in a very compressed space. 4. In sec- 
tion 5 we introduce collaborative email-spam filtering as a 
novel application for hash representations and provide ex- 
perimental results on large-scale real-world spam data sets. 

2. Hash Functions 

We introduce a variant on the hash kernel proposed by (Shi 
et al., 2009). This scheme is modified through the introduc- 
tion of a signed sum of hashed features whereas the original 
hash kernels use an unsigned sum. This modification leads 
to an unbiased estimate, which we demonstrate and further 
utilize in the following section. 

Definition 1 Denote by h a hash function h : N — > 
{1, . . . , m}. Moreover, denote by £ a hash function £ : 
N — > {±1}. Then for vectors x,x' € £2 we define the 
hashed feature map <j) and the corresponding inner product 
as 



j:h(j)=i 



and {x,x')^ := U^\x),^ h ^{x') 



(2) 



(3) 



present paper continues where (Shi et al., 2009) falls short: 
we prove exponential tail bounds. These bounds hold for 
general hash kernels, which we later apply to show how 
hashing enables us to do large-scale multitask learning ef- 
ficiently. We start with a simple lemma about the bias and 
variance of the hash kernel. The proof of this lemma ap- 
pears in appendix A. 

Lemma 2 The hash kernel is unbiased, that is 
E^[(i,i')J = (x, x'). Moreover, the variance is 



-ME 



ijtj x i x 'j 



\X\\2 



~\~ X%X ^X j X j 

0(±). 



and thus, for 



This suggests that typical values of the hash kernel should 
be concentrated within 0(— =) of the target value. We use 
Chebyshev's inequality to show that half of all observations 
are within a range of y/2cr. This, together with an indirect 
application of Talagrand's convex distance inequality via 
the result of (Liberty et al., 2008), enables us to construct 
exponential tail bounds. 

3.1. Concentration of Measure Bounds 

In this subsection we show that under a hashed feature-map 
the length of each vector is preserved with high probability. 
Talagrand's inequality (Ledoux, 2001) is a key tool for the 
proof of the following theorem (detailed in the appendix B). 

Theorem 3 Let e < 1 be a fixed constant and x be a given 
instance such that ||a;||2 = 1. If m > 721og(l/(5)/e 2 and 



< 



18^1og(l/5)log(m/5) 



we have that 



Prfllkll 



II > el < 26. 



(4) 



Although the hash functions in definition 1 are defined over 
the natural numbers N, in practice we often consider hash 
functions over arbitrary strings. These are equivalent, since 
each finite-length string can be represented by a unique nat- 
ural number. 

Usually, we abbreviate the notation <f>( h >Q(-) by just 4>{-). 
Two hash functions and 4>' are different when = <j^ h ^ 
and <j>' = (j>(. h '^"> such that either h' ^ h or £ ^ The 
purpose of the binary hash £ is to remove the bias inherent 
in the hash kernel of (Shi et al., 2009). 

In a multi-task setting, we obtain instances in combination 
with tasks, (x, u) G X x U . We can naturally extend our 
definition 1 to hash pairs, and will write <j> u (x) — <j>(x, u). 

3. Analysis 

The following section is dedicated to theoretical analysis 
of hash kernels and their applications. In this sense, the 



Note that an analogous result would also hold for the orig- 
inal hash kernel of (Shi et al., 2009), the only modifica- 
tion being the associated bias terms. The above result can 
also be utilized to show a concentration bound on the inner 
product between two general vectors x and x'. 

Coroliary 4 For two vectors x and x' , let us define 

a := vaax(cr x x , cr x i jX i, Gx—x',x—x') 

\x\\oo \\x'\\oo \\X-X'\\ C 



1Mb ' ||x'|| 2 ' \\ x - x 'h 



Also let A = 
^ \og{\ / 5)) and t 1 = 0{ 



x\\ 2 + ||x'|| 2 + \\x-x'\\ 2 . Ifm > 



then we have that 



log(m / 8) 

Pr [| (x,x') ct) ~(x,x')\>eA/2\<S. 

The proof for this corollary can be found in appendix C. We 
can also extend the bound in Theorem 3 for the maximal 
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canonical distortion over large sets of distances between 
vectors as follows: 

Corollary 5 If m > fi(-^ log(n/£)) and r) = 
0( 

iog(m/5) )• Denote by X — {x\, . . . ,x n } a set of vectors 
which satisfy \\xi —Xj\\ <T]\\xi — x j\\ 2 for all pairs i, j. 
In this case with probability 1 — 5 we have for all i,j 



c j Its 



Ik, 



< e. 



This means that the number of observations n (or corre- 
spondingly the size of the un-hashed kernel matrix) only 
enters logarithmically in the analysis. 

Proof We apply the bound of Theorem 3 to each distance 
individually. Note that each vector X{ — Xj satisfies the 
conditions of the theorem, and hence for each vector .t, — 
Xj, we preserve the distance upto a factor of (1 ± e) with 



probability 1 — 
gives us the result. 



Taking the union bound over all pairs 



3.2. Multiple Hashing 

Note that the tightness of the union bound in Corollary 5 
depends crucially on the magnitude of r\. In other words, 
for large values of r], that is, whenever some terms in x 
are very large, even a single collision can already lead to 
significant distortions of the embedding. This issue can 
be amended by trading off sparsity with variance. A vec- 
tor of unit length may be written as (1,0,0,0,...), or 

as ^-i=, -i=, 0, . . or more generally as a vector with c 

nonzero terms of magnitude c~ 2 . This is relevant, for in- 
stance whenever the magnitudes of x follow a known pat- 
tern, e.g. when representing documents as bags of words 
since we may simply hash frequent words several times. 
The following corollary gives an intuition as to how the 
confidence bounds scale in terms of the replications: 

Lemma 6 If we let x' = (x, . . . , x) then: 

1. It is norm preserving: \\x\\ 2 = |a;'|| 9 • 

i imi 

2. It reduces component magnitude by = °° . 

3. Variance increases to a x , x , = ^cr x x + ^-2 \\xWt, . 

Applying Lemma 6 to Theorem 3, a large magnitude can 
be decreased at the cost of an increased variance. 

3.3. Approximate Orthogonality 

For multitask learning, we must learn a different parameter 
vector for each related task. When mapped into the same 



hash-feature space we want to ensure that there is little in- 
teraction between the different parameter vectors. Let U be 
a set of different tasks, u € U being a specific one. Let w be 
a combination of the parameter vectors of tasks in U\ {u}. 
We show that for any observation x for task u, the inter- 
action of w with x in the hashed feature space is minimal. 
For each x, let the image of x under the hash feature-map 
for task u be denoted as <j> u (x) = 4>^' h '{{x, u)). 

Theorem 7 Let w g R m be a parameter vector for tasks 
in U \ {u}. In this case the value of the inner product 
{w, 4>u (x)) is bounded by 



Pr{\(w,<t> u {x)) \ >e}< 2e" 



a /3 



Proof We use Bernstein's inequality (Bernstein, 1946), 
which states that for independent random variables Xj, 
with E [Xj] = 0, if C > is such that \Xj \ < C, then 



Pr 



< cxp 



t 2 /2 



£"=iE[*?]+C7t/3, 



(5) 



We have to compute the concentration property of 



Let Xj 



By the definition of h and £, Xj are independent. Also, 
for each j, since w depends only on the hash-functions for 
U \ {u}, Wh(j) is independent of Thus, E[X,-] = 

E/£ m [ x j€U) w h(j)\ = 0- For each j, we also have \Xj\ < 
x\ 



= : C. Finally, V . E[X 2 ] is given by 



E 



^2{xji{j)w h{3) f 



-Y 

3,1 



x* W * = ±\\x\\l\\ W \\l 



The claim follows by plugging both terms and C into the 
Bernstein inequality (5). ■ 



Theorem 7 bounds the influence of unrelated tasks with any 
particular instance. In section 5 we demonstrate the real- 
world applicability with empirical results on a large-scale 
multi-task learning problem. 

4. Applications 

The advantage of feature hashing is that it allows for sig- 
nificant storage compression for parameter vectors: storing 
w in the raw feature space naively requires 0(d) numbers, 
when w £ R d . By hashing, we are able to reduce this to 
0(m) numbers while avoiding costly matrix-vector multi- 
plications common in Locally Sensitive Hashing. In addi- 
tion, the sparsity of the resulting vector is preserved. 
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The benefits of the hashing-trick leads to applications in 
almost all areas of machine learning and beyond. In par- 
ticular, feature hashing is extremely useful whenever large 
numbers of parameters with redundancies need to be stored 
within bounded memory capacity. 

Personalization One powerful application of feature 
hashing is found in multitask learning. Theorem 7 allows 
us to hash multiple classifiers for different tasks into one 
feature space with little interaction. To illustrate, we ex- 
plore this setting in the context of spam-classifier personal- 
ization. 

Suppose we have thousands of users U and want to per- 
form related but not identical classification tasks for each 
of the them. Users provide labeled data by marking emails 
as spam or not-spam. Ideally, for each user u € U, we 
want to learn a predictor w u based on the data of that user 
solely. However, webmail users are notoriously lazy in la- 
beling emails and even those that do not contribute to the 
training data expect a working spam filter. Therefore, we 
also need to learn an additional global predictor w to allow 
data sharing amongst all users. 

Storing all predictors Wi requires 0(d x (|[7| + 1)) mem- 
ory. In a task like collaborative spam-filtering, \U\, the 
number of users can be in the hundreds of thousands and 
the size of the vocabulary is usually in the order of mil- 
lions. The naive way of dealing with this is to elimi- 
nate all infrequent tokens. However, spammers target this 
memory-vulnerability by maliciously misspelling words 
and thereby creating highly infrequent but spam-typical 
tokens that "fall under the radar" of conventional classi- 
fiers. Instead, if all words are hashed into a finite-sized 
feature vector, infrequent but class-indicative tokens get a 
chance to contribute to the classification outcome. Further, 
large scale spam-filters (e.g. Yahoo Mail™or GMail™) 
typically have severe memory and time constraints, since 
they have to handle billions of emails per day. To guaran- 
tee a finite-size memory footprint we hash all weight vec- 
tors wq, . . . , w\u\ into a joint, significantly smaller, feature 
space M m with different hash functions ^>q> • • • > 4>\u\- The 
resulting hashed- weight vector E W n can then be writ- 
ten as: 

w h = 4> (w ) + 2J (t>u{Wu)- (6) 

Note that in practice the weight vector Wh can be learned 
directly in the hashed space. All un-hashed weight vectors 
never need to be computed. Given a new document/email 
x of user u £ U, the prediction task now consists of calcu- 
lating (<po(x) + <j) u {x),Wh)- Due to hashing we have two 
sources of error - distortion of the hashed inner prod- 
ucts and the interference with other hashed weight vectors 



More precisely: 

((j>o{x) + (j> u {x),w h ) = (x,w + w u ) +e d + ei. (7) 

The interference error consists of all collisions between 
4>o(x) or 4> u (x) with hash functions of other users, 

Cj= ^ (0oO),<Mw U )) + ^2 (<t>u{x),(/)v{w v )) . (8) 

To show that e» is small with high probability we can 
apply Theorem 7 twice, once for each term of (8). 
We consider each user's classification to be a separate 
task, and since YlveUv^o Wv * s independent of the hash- 
function </>o, the conditions of Theorem 7 apply with w = 
J2 v ^o Wv an< ^ we can em pl°y it t° bound the second term, 
J2veu,v^o (iW-^K))' The second application is 
identical except that all subscripts "0" are substituted with 
"u". For lack of space we do not derive the exact bounds. 

The distortion error occurs because each hash function that 
is utilized by user u can self-collide: 

e d = ^2 I {4>v{x),4> v {w v )) - (x,w v ) |. (9) 
ve{u,o} 

To show that is small with high probability, we apply 
Corollary 4 once for each possible values of v. 

In section 5 we show experimental results for this set- 
ting. The empirical results are stronger than the theoretical 
bounds derived in this subsection — our technique outper- 
forms a single global classifier on hundreds thousands of 
users. We discuss an intuitive explanation in section 5. 

Massively Multiclass Estimation We can also regard 
massively multi-class classification as a multitask problem, 
and apply feature hashing in a way similar to the person- 
alization setting. Instead of using a different hash func- 
tion for each user, we use a different hash function for each 
class. 

(Shi et al., 2009) apply feature hashing to problems with 
a high number of categories. They show empirically that 
joint hashing of the feature vector <j>(x, y) can be efficiently 
achieved for problems with millions of features and thou- 
sands of classes. 

Collaborative Filtering Assume that we are given a very 
large sparse matrix M where the entry My indicates what 
action user i took on instance j, A common example for 
actions and instances is user-ratings of movies (Bennett & 
Lanning, ). A successful method for finding common fac- 
tors amongst users and instances for predicting unobserved 
actions is to factorize M into M = U T W. If we have 
millions of users performing millions of actions, storing U 
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Figure 1. The hashed personalization summarized in a schematic 
layout. Each token is duplicated and one copy is individualized 
(e.g. by concatenating each word with a unique user identifier). 
Then, the global hash function maps all tokens into a low dimen- 
sional feature space where the document is classified. 



and W in memory quickly becomes infeasible. Instead, we 
may choose to compress the matrices U and W using hash- 
ing. For U,W G R nxd denote by u, w <E R m vectors with 
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Figure 2. The decrease of uncaught spam over the baseline clas- 
sifier averaged over all users. The classification threshold was 
chosen to keep the not-spam misclassification fixed at 1%. 
The hashed global classifier (global-hashed) converges relatively 
soon, showing that the distortion error ea vanishes. The personal- 
ized classifier results in an average improvement of up to 30%. 



Ui= t(j,QU Jk and Wi = J2 fUWit. 

j,k:h(j,k)=i j,k:h'(j,k)=i 



where (h,£) and (h',£') are independently chosen hash 
functions. This allows us to approximate matrix elements 

Ma = [U T W} 13 via 

k 

This gives a compressed vector representation of M that 
can be efficiently stored. 

5. Results 

We evaluated our algorithm in the setting of personaliza- 
tion. As data set, we used a proprietary email spam- 
classification task of n — 3.2 million emails, properly 
anonymized, collected from \U\ = 433167 users. Each 
email is labeled as spam or not-spam by one user in U. Af- 
ter tokenization, the data set consists of 40 million unique 
words. 

For all experiments in this paper, we used the Vowpal Wab- 
bit implementation 1 of stochastic gradient descent on a 
square-loss. In the mail-spam literature the misclassifica- 
tion of not-spam is considered to be much more harmful 
than misclassification of spam. We therefore follow the 
convention to set the classification threshold during test 
time such that exactly 1% of the not — spam test data is 
classified as spam Our implementation of the personalized 
hash functions is illustrated in Figure 1 . To obtain a person- 
alized hash function <fi u for user u, we concatenate a unique 
user-id to each word in the email and then hash the newly 
generated tokens with the same global hash function. 

1 http://hunch.net/~vw/ 



The data set was collected over a span of 14 days. We 
used the first 10 days for training and the remaining 4 days 
for testing. As baseline, we chose the purely global classi- 
fier trained over all users and hashed into 2 26 dimensional 
space. As 2 26 far exceeds the total number of unique words 
we can regard the baseline to be representative for the clas- 
sification without hashing. All results are reported as the 
amount of spam that passed the filter undetected, relative 
to this baseline (eg. a value of 0.80 indicates a 20% reduc- 
tion in spam for the user) 2 . 

Figure 2 displays the average amount of spam in users' in- 
boxes as a function of the number of hash keys m, relative 
to the baseline above. In addition to the baseline, we eval- 
uate two different settings. 

The global-hashed curve represents the relative 
spam catch-rate of the global classifier after hashing 
(4>o(wo) , 4>o(x)) ■ At m = 2 26 this is identical to the 
baseline. Early convergence at m — 2 22 suggests that at 
this point hash collisions have no impact on the classifi- 
cation error and the baseline is indeed equivalent to that 
obtainable without hashing. 

In the personalized setting each user u G U gets her own 
classifier u (w u ) as well as the global classifier 4>o(wq). 
Without hashing the feature space explodes, as the cross 
product of u = 400-K" users and n = 40M tokens results 
in 16 trillion possible unique personalized features. Fig- 
ure 2 shows that despite aggressive hashing, personaliza- 
tion results in a 30% spam reduction once the hash table is 
indexed by 22 bits. 

2 As part of our data sharing agreement, we agreed not to in- 
clude absolute classification error-rates. 
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Figure 3. Results for users clustered by training emails. For ex- 
ample, the bucket [8, 15] consists of all users with eight to fifteen 
training emails. Although users in buckets with large amounts of 
training data do benefit more from the personalized classifier (up- 
to 65% reduction in spam), even users that did not contribute to 
the training corpus at all obtain almost 20% spam-reduction. 



User clustering One hypothesis for the strong results in 
Figure 2 might originate from the non-uniform distribution 
of user votes — it is possible that using personalization and 
feature hashing we benefit a small number of users who 
have labeled many emails, degrading the performance of 
most users (who have labeled few or no emails) in the pro- 
cess. In fact, in real life, a large fraction of email users do 
not contribute at all to the training corpus and only interact 
with the classifier during test time. The personalized ver- 
sion of the test email $„(i M ) is then hashed into buckets 
of other tokens and only adds interference noise e, to the 
classification. 

In order to show that we improve the performance of most 
users, it is therefore important that we not only report av- 
eraged results over all emails, but explicitly examine the 
effects of the personalized classifier for users depending 
on their contribution to the training set. To this end, we 
place users into exponentially growing buckets based on 
their number of training emails and compute the relative 
reduction of uncaught spam for each bucket individually. 
Figure 3 shows the results on a per-bucket basis. We do not 
compare against a purely local approach, with no global 
component, since for a large fraction of users — those with- 
out training data — this approach cannot outperform ran- 
dom guessing. 

It might appear rather surprising that users in the bucket 
with none or very little training emails (the line of bucket 
[0] is identical to bucket [1]) also benefit from personal- 
ization. After all, their personalized classifier was never 
trained and can only add noise at test-time. The classifier 
improvement of this bucket can be explained by the sub- 
jective definition of spam and not-spam. In the personal- 
ized setting the individual component of user labeling is 
absorbed by the local classifiers and the global classifier 



represents the common definition of spam and not-spam. 
In other words, the global part of the personalized classi- 
fier obtains better generalization properties, benefiting all 
users. 

6. Related Work 

A number of researchers have tackled related, albeit differ- 
ent problems. 

(Rahimi & Recht, 2008) use Bochner's theorem and sam- 
pling to obtain approximate inner products for Radial Ba- 
sis Function kernels. (Rahimi & Recht, 2009) extend this 
to sparse approximation of weighted combinations of ba- 
sis functions. This is computationally efficient for many 
function spaces. Note that the representation is dense. 

(Li et al., 2007) take a complementary approach: for sparse 
feature vectors, <j)(x), they devise a scheme of reducing the 
number of nonzero terms even further. While this is in prin- 
ciple desirable, it does not resolve the problem of 4>(x) be- 
ing high dimensional. More succinctly, it is necessary to 
express the function in the dual representation rather than 
expressing / as a linear function, where w is unlikely to be 
compactly represented: f(x) = (4>(x) 7 w). 

(Achlioptas, 2003) provides computationally efficient ran- 
domization schemes for dimensionality reduction. Instead 
of performing a dense d ■ in dimensional matrix vector mul- 
tiplication to reduce the dimensionality for a vector of di- 
mensionality d to one of dimensionality to, as is required 
by the algorithm of (Gionis et al., 1999), he only requires | 
of that computation by designing a matrix consisting only 
of entries {—1,0,1}. Pioneered by (Ailon & Chazelle, 
2006), there has been a line of work (Ailon & Liberty, 
2008; Matousek, 2008) on improving the complexity of 
random projection by using various code-matrices in or- 
der to preprocess the input vectors. Some of our theoretical 
bounds are derivable from that of (Liberty et al., 2008). 

A related construction is the CountMin sketch of (Cor- 
mode & Muthukrishnan, 2004) which stores counts in 
a number of replicates of a hash table. This leads to good 
concentration inequalities for range and point queries. 

(Shi et al., 2009) propose a hash kernel to deal with the is- 
sue of computational efficiency by a very simple algorithm: 
high-dimensional vectors are compressed by adding up all 
coordinates which have the same hash value — one only 
needs to perform as many calculations as there are nonzero 
terms in the vector. This is a significant computational sav- 
ing over locality sensitive hashing (Achlioptas, 2003; Gio- 
nis et al., 1999). 

Several additional works provide motivation for the investi- 
gation of hashing representations. For example, (Ganchev 
& Dredze, 2008) provide empirical evidence that the hash- 
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ing trick can be used to effectively reduce the memory 
footprint on many sparse learning problems by an order of 
magnitude via removal of the dictionary. Our experimen- 
tal results validate this, and show that much more radical 
compression levels are achievable. In addition, (Langford 
et al., 2007) released the Vowpal Wabbit fast online learn- 
ing software which uses a hash representation similar to 
that discussed here. 

7. Conclusion 

In this paper we analyze the hashing-trick for dimensional- 
ity reduction theoretically and empirically. As part of our 
theoretical analysis we introduce unbiased hash functions 
and provide exponential tail bounds for hash kernels. These 
give further inside into hash-spaces and explain previously 
made empirical observations. We also derive that random 
subspaces of the hashed space are likely to not interact, 
which makes multitask learning with many tasks possible. 

Our empirical results validate this on a real-world applica- 
tion within the context of spam filtering. Here we demon- 
strate that even with a very large number of tasks and 
features, all mapped into a joint lower dimensional hash- 
space, one can obtain impressive classification results with 
finite memory guarantee. 
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A. Mean and Variance 

Proof [Lemma 2] To compute the expectation we expand 

Since E^[{x,x')^\ = E h [E^[(x, x'}^]], taking expecta- 
tions over £ we see that only the terms i = j have nonzero 
value, which shows the first claim. For the variance we 
compute E^Kx, x')^]. Expanding this, we get: 

i 7 j,k,l 



the proof follows that of Lemma 6 (Dasgupta et al., 2010). 
We still outline the proof because of some parameter values 
being different. 

Lemma 8 Let m be the size of the hash function range and 



let i] 



I^J m log(m/<5) 



If x is such that \\x\\ 



1 and 



Halloo < r], then define a 2 = max^ Ylj=i x j&ih{j) where i 
ranges over all hash-buckets. We have that with probability 

m 



This expression can be simplified by noting that: 

E« [f (j)£(*0f (01 = <Mfc* + [1 - Si jf a](SihSji + SaSjk). bucket, i.e 



Proof We outline the proof-steps. Since the buck- 
ets have identical distribution, we look only at the 1 st 



Passing the expectation over £ through the sum, this allows 
us to break down the expansion of the variance into two 
terms. 



e [(x,x'; 



^Xix'iXkx'k + ^2xfx'j 2 E h [8h(i),h(j) 



Xjx'jXjx'jEh [Sh{i),h(j)] 

i¥=3 



(x,x') 2 + 



by noting that E h [S h (i),hU)] = m for * ^ Using the fact 
that a 2 = Ecf,[(x, x')A — E^[(a;, x')<} 2 proves the claim. ■ 



fine X, 



at % = 



,•2 



1 and bound E 7: w ? )=i 



De- 



i). Then E h [XA = and 



E h \X 2 ] =xUh 



< -± < 



using 1 1 a; | 



< 



n. Thus, EjEhiX]] < a-. Also note that ^.Xj 



] 



j:h(j) = l 



— . Plugging this into the Bernstein's in- 



equality, equation 5, we have that 

j 

3 



l/2m 2 



cxp(- 



877777 



rj 2 /m + i] 2 /3m 
2 ) < exp(— \og(m/d)) < <5/m 



By taking union bound over all the m buckets, we get the 
above result. 



B. Concentration of Measure 

We use the concentration result derived by Liberty, Ailon 
and Singer in (Liberty et al., 2008). Liberty et al. cre- 
ate a Johnson-Lindenstrauss random projection matrix by 
combining a carefully constructed deterministic matrix A 
with random diagonal matrices. For completeness we 
restate the relevant lemma. Let i range over the hash- 
buckets. Let m = clog(l/<5)/e 2 for a large enough con- 
stant c. For a given vector x, define the diagonal matrix 
D x as (D x )jj = xj. For any matrix A 6 5R mxd , define 
\\x\\ A = max y: ||, y || 2=1 \\AD x y\\ 2 . 

Lemma 2 (Liberty et al., 2008). For any column- 
normalized matrix A, vector x with \\x\\ 2 = 1 and an 
i.i.d. random ±1 diagonal matrix D s , the following holds: 
Vx, if \\x\\ A < Wlo e g{1/s) then, Pr[|||AD a x|| a -l| > e] < 
S. 

We also need the following form of a weighted balls and 
bins inequality - the statement of the Lemma, as well as 



Proof [Theorem 3] Given the function <f> = (h, r), define 
the matrix A as Aij = 8ih(j) and D s as (D s )jj — Tj. Let 
a; be as specified, i.e. |.t||2 = 1 and ||x||oo < r/. Note that 
WxWt = \\AD s x\\ 2 - Let y G 5i d be such that \\y\\ 2 = 1. 
Thus 

mid \ 

II^AsS/lla = Y^ y i Sih (j) x i 
i=i J 

m d d 

i=l .7=1 j=l 
m d 

<E(E^)K 2 <^ 2 - 

i=i j=i 

by applying the Cauchy-Schwartz inequality, and using the 
definition of <r*. Thus, a = max y.\\y\\ 2 =i \\ADxy\\2 < 
o~ * < \/2m~ 1/2 . If r7i > |f log(l/<5), we have that 
llxlU < — , £ , which satisfies the conditions of 

6V 1 °g( 1 /' 5 ) 

Lemma 2 from (Liberty et al., 2008). Thus applying the 
above result from Lemma 2 (Liberty et al., 2008) to x, and 
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using Lemma 8, we have that Pr[|||^4Z) s :z:|| 2 — 1| > e] < <5 
and hence 

Pr[|H£-l|>6]<* 

by taking union over the two error probabilities of Lemma 
2 and Lemma 8, we have the result. ■ 

C. Inner Product 

Proof [Corollary 4] We have that 2 (x, = + 

|| a:' || j — — x'|L. Taking expectations, we have the stan- 
dard inner product inequality. Thus, 

\2(x 1 x') 4> ~2( X ,x')\<\\\x\\l-\\x\\ 2 \ 

+ Ill^l| 2 -Il^l| 2 | + Ill^-^I| 2 ,-Il^-^H 2 | 

Using union bound, with probability 1 — 36, each of the 
terms above is bounded using Theorem 3. Thus, putting 
the bounds together, we have that, with probability 1 — 36, 

|2 (M*)M*)) - 2 (*> x ) I < <\M 2 + IN'II 2 + \\ x - *'ll 2 ) 



D. Refutation of the Previous Incorrect Proof 

There were a few bugs in the previous version of the pa- 
per (Weinberger et al., 2009). We now detail each of them 
and illustrate why it was an error. The current result shows 
that the using hashing we can create a projection matrix 
that can preserve distances to a factor of (lie) for vectors 
with a bounded ||a;||oo/||x||2 ratio. The constraint on input 
vectors can be circumvented by multiple hashing, as out- 
lined in Section 3.2, but that would require hashing O(jj) 
times. Recent work (Dasgupta et al., 2010) suggests that 
better theoretical bounds can be shown for this construc- 
tion. We thank Tamas Sarlos and Ravi Kumar for the fol- 
lowing writeup on the errors and for suggestion the new 
proof in Appendix B. 

1 . The statement of the main theorem in Weinberger et 
al. (Weinberger et al., 2009, Theorem 3) is false as 
it contradicts the lower bound of Alon (Alon, 2003). 
The flaw lies in the probability of error in (Weinberger 
et al., 2009, Theorem 3), which was claimed to be 
cxp(— ^p). This error can be made arbitrarily small 
without increasing the embedding dimensionality m 
but by decreasing r\ = ^ ^ , which in turn can be 
achieved by preprocessing the input vectors x. How- 
ever, this contradicts Alon's lower bound on the em- 



bedding dimensionality. The details of this contra- 
diction are best presented through (Weinberger et al., 
2009, Corollary 5) as follows. 

Set m = 128 and 5 = 1/2 and consider the ver- 
tices of the n-simplex in 5R ,l+1 , i.e., xi = (1, 0, 0), 
x 2 = (0,1,0,...,0), .... Let P G Sft(«+i)cx(n+i) 
be the naive, replication based preconditioner, with 
replication parameter c = 512 log 2 n as defined in 
Section 2 of our submission or (Weinberger et al., 
2009, Section 3.2). Therefore for all pairs i ^ j 
we have that \\Pxi — PxjW^ = 1/y/c and that 
\\PXi — Pxj\\2 = V%- Hence we can apply (Wein- 
berger et al., 2009, Corollary 5) to the set of vec- 
tors Pxi with 7/ = l/\/2c = 1/(32 log n); then the 

claimed approximation error is log 2 ^ = 

| + Yg < i. If Corollary 5 were true, then it would fol- 
low that with probability at least 1 /2, the linear trans- 
formation A = 4> ■ P : k n+1 -> 5R Tn distorts the pair- 
wise distances of the above n + 1 vectors by at most 
a 1 ± 1/4 multiplicative factor. On the other hand, 
the lower bound of Alon shows that any such transfor- 
mation A must map to il(log?i) dimensions; see the 
remarks following Theorem 9.3 in (Alon, 2003) and 
set e = 1/4 there. This clearly contradicts m = 128 
above. 

2. The proof of the Theorem 3 contained a fatal, un- 
fixable error. Recall that 6ij denotes the usual Kro- 
necker symbol, and h and h! are hash functions. Wein- 
berger et al. make the following observation after 
equation (13) of their proof on page 8 in Appendix 
B. 

"First note that J2 t E 3 S h(j)i + &h'(j)i is at 
most 2t, where t = \{j : h(j) ^ h'UW 

The quoted observation is false. Let d denote the di- 
mension of the input. Then, J2 t S h(j)i + S h>u)i = 
}2,(y,,<hr, . + = J2j 2 = 2d, independent 

of the choice of the hash function. Note that t played 
a crucial role in the proof of (Weinberger et al., 2009) 
relating the Euclidean approximation error of the di- 
mensionality reduction to Talagrand's convex distance 
defined over the set of hash functions. Albeit the error 
is elementary, we do not see how to rectify its conse- 
quences in (Weinberger et al., 2009) even if the claim 
were of the right form. 

3. The proof of Theorem 3 in (Weinberger et al., 2009) 
also contains a minor and fixable error. To see this, 
consider the sentence towards the end of the proof 
Theorem 3 in (Weinberger et al., 2009) where < 

e < 1 and (3 = (3(x) > 1. 

"Noting that s 2 = (\/ P 2 + e - 

£)/4|M|oo > ^/4|MU,..." 
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Here the authors wrongly assume that y 1 'ft 2 + e — (3 > 
y/e holds, whereas the truth is \J j3 2 + e — j3 < \fl 
always. 

Observe that this glitch is easy to fix locally, however 
this change is minor and the modified claim would 
still be false. Since for all < y < 1 we have 
that VI + V > 1 + 2/A from f3 > 1 it follows 
that V 1 (3 2 + e - > e/3. Plugging the latter esti- 
mate into the "proof" of Theorem 3 would result in a 
modified claim where the original probability of error, 
cxp(— j^), is replaced with cxp(— Updating 
the numeric constants in the first section of this note 
would show that the new claim still contradicts Alon's 
lower bound. To justify observe that counter example 
is based on a constant e and the modified claim would 
still lack the necessary fi(log n) dependency in its tar- 
get dimensionality. 



